Hands-On ANOVA in R

R Coding Club
RTG 2660

Dr. Lea Hildebrandt

2024-04-30

Repeated-Measures ANOVA

Often in our research, we have repeated measures: We collect more than one data point from each subject.

Advantage: increased power and precision.
Disadvantage: violates iid (independent and identically distributed) assumption of ordinary least squares methods (regression, ANOVA).

  • independent: probability of a data point taking on specific value is independent of of values taken by other data points.

  • identically distributed: data points are samples from same underlying distribution.

Problem: Observations from the same subject are usually correlated (i.e. more likely to be similar to each other than to other observations).

–> this also applies to other forms of dependency: if units of observations are clustered, e.g. mice in cage/handled by experimenter, students in classroom…

In contrast to violations of other assumptions (normality, homoscadicity), ANOVA is not robust against this violation! –> increased type 1 error/false positives

Difference rm to “normal” ANOVA

  • repeated measures: each subject provides several data points.

  • F-statistic is calculated a bit differently: \[ F = MS_{conditions} / MS_{error} \]

1

Example Sum of Squares partitioning

Imagine, we’d have a wine tasting and everyone of you tries six different wines and gives a rating from 1-10. This way, we would get the following data set:

Subject Wine1 Wine2 Wine3 Wine4 Wine5 Wine6 Mean_subject
Ecem 5 6 7 6 3 5 5.333
Ulkar 7 8 8 7 6 6 7
Francesco 1 4 3 7 4 4 3.833
Anna 6 4 6 5 3 4 4.667
Nikita 8 7 4 6 9 3 6.167
Zora 2 3 2 10 4 3 4
Mean_wine 4.833 5.333 5 6.833 4.833 4.167 5.167

Example Sum of Squares partitioning - 2

\[ SS_{wine} = \sum_{i=1}^{n}n_i(\bar{x_i}-\bar{x})^2 \]

For every wine, we subtract the grand mean (triangle) from each wine’s mean (black dots) (and multiply it by the number of subjects).

\[ SS_{within} = \sum_{j=1}^{k}\sum_{i=1}^{n}(x_{ij}-\bar{x_j})^2 \]

For the SSwithin, we subtract each individual data point from the mean of it’s wine group and add that up for all wines.

\[ SS_{subjects} = k\sum_{i=1}^{n}(\bar{x_i}-\bar{x})^2 \]

Here, we take each subject’s mean and subtract the grand average from it, and multiply it with the number of wines. \[SS_{error} = SS_{within} - SS_{subjects} \]

We can then calculate the Mean Squares by dividing with the Degrees of Freedom. The MS are then used to calculate the F-Value:

\[ MS_{wine} = \frac{SS_{wine}}{(k-1)} \]

\[ MS_{error} = \frac{SS_{error}}{(n-1)(k-1)} \]

\[F=\frac{MS_{wine}}{MS_{error}}\]

Calculating an ANOVA in R

There are a variety of functions to run an ANOVA in R. We will be using those from the afex package! (But any other is fine as well! The afex package has some advantages: The contrast definition and the Type 3 sum of squares etc.)

The main function is aov_car() - because it runs the ANOVA() function from the car package. aov_ez() and aov_4() are wrappers around aov_car(): They should reproduce the same result, but the input format/syntax differs.

# library(tidyverse)
library(afex)
library(performance) # for checking assumptions
library(emmeans) # for contrasts/post-hoc tests
library(report) # handy tool to help you report your findings

Let’s use a bigger data set1:

set.seed(42)
z <- data.frame(ID = 1:300,
                condition  = rep(c("A", "B", "C"), each = 100),
                gender  = factor(rbinom(300, 1, .5)),
                T_1 = c(rnorm(100,2), rnorm(100,1),rnorm(100,0)),
                T_2 = c(rnorm(100,2), rnorm(100,1),rnorm(100,0)),
                T_3 = c(rnorm(100,2), rnorm(100,1),rnorm(100,0)))

With the afex package, we can use the same function to calculate one-way ANOVA, factorial ANOVA, repeated measures ANOVA, and mixed ANOVA. For this aim, we always have to specify the Error() variance.

E.g. one-way & factorial ANOVA:

aov1 <- z %>%
  aov_car(T_1 ~ condition + Error(ID),
          data = .)
aov1

# using different syntax:
aov2 <- z %>%
  aov_car(T_1 ~ condition * gender + Error(ID),
          data = .)

aov2.1 <- z %>%
  select(T_1, condition, gender, ID) %>%
  aov_ez(id = "ID",
         dv = "T_1",
         between = c("condition", "gender"),
         data = .)

aov2 <- z %>%
  aov_4(T_1 ~ condition * gender + (1|ID),
        data = .)

aov2

Repeated measures ANOVA with afex

For a rmANOVA (in R with the afex package), we need the data to be in long format:

z_long <- z %>% 
  pivot_longer(cols = starts_with("T"),
               names_to = "timepoint")


# calculate ANOVA

aov_rm <- z_long %>%
  aov_car(value ~ 1 + Error(ID/timepoint),
          data = .)

# I prefer the formular notation:
aov_rm <- z_long %>%
  aov_4(value ~ 1 + (timepoint|ID),
        data = .)
aov_rm
Anova Table (Type 3 tests)

Response: value
     Effect           df  MSE    F   ges p.value
1 timepoint 1.99, 595.97 1.06 0.28 <.001    .754
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

Sphericity correction method: GG 

Mixed ANOVA

We can simply extent our model by adding between-subjects variables:

mixed_mod <- z_long %>%
  aov_4(value ~ condition + gender + (timepoint|ID),
        data = .)
mixed_mod
Anova Table (Type 3 tests)

Response: value
               Effect           df  MSE          F   ges p.value
1           condition       2, 296 0.94 304.08 ***  .387   <.001
2              gender       1, 296 0.94       0.32 <.001    .573
3           timepoint 1.99, 590.22 1.06       0.28 <.001    .756
4 condition:timepoint 3.99, 590.22 1.06       1.23  .006    .297
5    gender:timepoint 1.99, 590.22 1.06       0.24 <.001    .786
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

Sphericity correction method: GG 

Checking assumptions

Unfortunately, we can’t use the performance``::check_model() (yet) to check the assumptions of model fit with afex. This means that we’d have to check the assumptions individually:

  • continuous DV

  • (outliers)

  • sphericity (the variances of the differences between all levels must be equal)

  • normality of residuals

# sphericity
test_sphericity(mixed_mod)  # afex
OK: Data seems to be spherical (p > 0.641).
check_sphericity(mixed_mod) # performance
OK: Data seems to be spherical (p > 0.641).
# normality of residuals
# residuals(mixed_mod)
plot(check_normality(mixed_mod))

Contrasts

We can use the emmeans package to calculate “estimated marginal means”:

emmixed <- mixed_mod %>% 
  emmeans(specs = c("timepoint", "condition")) %>% 
  as_tibble()

ggplot(emmixed, aes(timepoint, emmean, group = condition, color = condition)) +
    geom_point() +
    geom_line() +
    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL),
                  width = .2) 

We can use those to calculate pairwise contrasts:

#|
mixed_mod %>% 
  emmeans(specs = c("timepoint", "condition")) %>% 
  pairs(adjust = "tukey")
 contrast      estimate    SE  df t.ratio p.value
 T_1 A - T_2 A  -0.0166 0.142 296  -0.117  1.0000
 T_1 A - T_3 A   0.1751 0.148 296   1.181  0.9601
 T_1 A - T_1 B   1.0370 0.134 296   7.735  <.0001
 T_1 A - T_2 B   1.0652 0.138 296   7.709  <.0001
 T_1 A - T_3 B   0.9146 0.143 296   6.382  <.0001
 T_1 A - T_1 C   1.9918 0.135 296  14.712  <.0001
 T_1 A - T_2 C   2.1386 0.139 296  15.423  <.0001
 T_1 A - T_3 C   1.9307 0.144 296  13.423  <.0001
 T_2 A - T_3 A   0.1916 0.147 296   1.307  0.9286
 T_2 A - T_1 B   1.0536 0.138 296   7.626  <.0001
 T_2 A - T_2 B   1.0818 0.142 296   7.642  <.0001
 T_2 A - T_3 B   0.9312 0.147 296   6.341  <.0001
 T_2 A - T_1 C   2.0084 0.139 296  14.491  <.0001
 T_2 A - T_2 C   2.1551 0.143 296  15.076  <.0001
 T_2 A - T_3 C   1.9473 0.147 296  13.207  <.0001
 T_3 A - T_1 B   0.8619 0.143 296   6.015  <.0001
 T_3 A - T_2 B   0.8901 0.147 296   6.061  <.0001
 T_3 A - T_3 B   0.7396 0.151 296   4.885  0.0001
 T_3 A - T_1 C   1.8168 0.144 296  12.643  <.0001
 T_3 A - T_2 C   1.9635 0.147 296  13.324  <.0001
 T_3 A - T_3 C   1.7557 0.153 296  11.484  <.0001
 T_1 B - T_2 B   0.0282 0.142 296   0.199  1.0000
 T_1 B - T_3 B  -0.1224 0.148 296  -0.825  0.9961
 T_1 B - T_1 C   0.9548 0.136 296   7.044  <.0001
 T_1 B - T_2 C   1.1016 0.139 296   7.942  <.0001
 T_1 B - T_3 C   0.8937 0.144 296   6.212  <.0001
 T_2 B - T_3 B  -0.1506 0.147 296  -1.026  0.9831
 T_2 B - T_1 C   0.9266 0.139 296   6.683  <.0001
 T_2 B - T_2 C   1.0734 0.143 296   7.500  <.0001
 T_2 B - T_3 C   0.8655 0.147 296   5.868  <.0001
 T_3 B - T_1 C   1.0772 0.144 296   7.494  <.0001
 T_3 B - T_2 C   1.2239 0.147 296   8.302  <.0001
 T_3 B - T_3 C   1.0161 0.153 296   6.639  <.0001
 T_1 C - T_2 C   0.1467 0.143 296   1.028  0.9829
 T_1 C - T_3 C  -0.0611 0.149 296  -0.409  1.0000
 T_2 C - T_3 C  -0.2078 0.148 296  -1.406  0.8948

Results are averaged over the levels of: gender 
P value adjustment: tukey method for comparing a family of 9 estimates 

Custom Contrasts

There are a number of ways to specify which contrasts you want to have in emmeans. Here are two examples, one using a predefined contrast (trt.vs.ctrlk) and one example with manually defined contrasts.

# compare everything to "last" level, in this case T_3:
mixed_mod %>% 
  emmeans(specs = trt.vs.ctrlk ~ timepoint)

###
# get emmeans
emm1 <-  emmeans(mixed_mod, specs = ~ timepoint)
emm1

# we need to get the correct order:
T_1_c <- c(1,0,0)
T_2_c <- c(0,1,0)
T_3_c <- c(0,0,1)
T_23_c <- c(0,.5,.5)

# get custom contrasts from list
contrast(emm1, method = list("timepoint 1 - timepoint 3" = T_1_c - T_3_c,
                             "timepoint 1 - timepoint 2" = T_1_c - T_3_c,
                             "tp 1 - mean of tp 2 & 3" = T_1_c - (T_2_c + T_3_c)/2,
                             "tp 1 - mean of tp 2 & 3" = T_1_c - T_23_c) )

# same is possible with interactions/several factors
emm2 <-  emmeans(mixed_mod, specs = ~ c(timepoint, condition))
emm2

T_1_A <- c(1,0,0,0,0,0,0,0,0) # etc

Simple Slopes

Sometimes, you do have a continuous IV in combination with e.g. a factorial IV - e.g. if you want to further investigate their interaction. Here, you can’t easily work with contrasts (because which value do you take from the continuous variable? The mean?). You can get the slope (of the predicted values) for the cont. IV within each factor level of the other categorical factor. This is called a simple slope. For this aim, you would use the function emtrends().

(We will cover this later, when we talk about Linear Mixed Models - because you would probably not expect an interaction between a factor/IV and a cont. covariate…)

Thanks!

Next time? May 14th?

Topic? LMM? :D